22 research outputs found

    Efficient Snapshot Retrieval over Historical Graph Data

    Full text link
    We address the problem of managing historical data for large evolving information networks like social networks or citation networks, with the goal to enable temporal and evolutionary queries and analysis. We present the design and architecture of a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition to maintaining the current state for ongoing updates. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compactly recording the historical information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right parameters for a specific scenario. In addition, we present strategies for materializing portions of the historical graph state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information

    HISTORICAL GRAPH DATA MANAGEMENT

    Get PDF
    Over the last decade, we have witnessed an increasing interest in temporal analysis of information networks such as social networks or citation networks. Finding temporal interaction patterns, visualizing the evolution of graph properties, or even simply comparing them across time, has proven to add significant value in reasoning over networks. However, because of the lack of underlying data management support, much of the work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. Unfortunately, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, information diffusion, formation of online communities, and so on. In the absence of appropriate support, an analyst today has to manually navigate the added temporal complexity of large evolving graphs, making the process cumbersome and ineffective. In this dissertation, I address the key challenges in storing, retrieving, and analyzing large historical graphs. In the first part, I present DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compact recording of the historical information, and that supports efficient retrieval of historical graph snapshots. I present analytical models for estimating required storage space and snapshot retrieval times which aid in choosing the right parameters for a specific scenario. I also present optimizations such as partial materialization and columnar storage to speed up snapshot retrieval. In the second part, I present Temporal Graph Index that builds upon DeltaGraph to support version-centric retrieval such as a node’s 1-hop neighborhood history, along with snapshot reconstruction. It provides high scalability, employing careful partitioning, distribution, and replication strategies that effectively deal with temporal and topological skew, typical of temporal graph datasets. In the last part of the dissertation, I present Temporal Graph Analysis Framework that enables analysts to effectively express a variety of complex historical graph analysis tasks using a set of novel temporal graph operators and to execute them in an efficient and scalable manner on a cloud. My proposed solutions are engineered in the form of a framework called the Historical Graph Store, designed to facilitate a wide variety of large-scale historical graph analysis

    Trust in AutoML: Exploring Information Needs for Establishing Trust in Automated Machine Learning Systems

    Full text link
    We explore trust in a relatively new area of data science: Automated Machine Learning (AutoML). In AutoML, AI methods are used to generate and optimize machine learning models by automatically engineering features, selecting models, and optimizing hyperparameters. In this paper, we seek to understand what kinds of information influence data scientists' trust in the models produced by AutoML? We operationalize trust as a willingness to deploy a model produced using automated methods. We report results from three studies -- qualitative interviews, a controlled experiment, and a card-sorting task -- to understand the information needs of data scientists for establishing trust in AutoML systems. We find that including transparency features in an AutoML tool increased user trust and understandability in the tool; and out of all proposed features, model performance metrics and visualizations are the most important information to data scientists when establishing their trust with an AutoML tool.Comment: IUI 202

    Sense-making strategies in explorative intelligence analysis of network evolutions

    Get PDF
    Visualising how social networks evolve is important in intelligence analysis in order to detect and monitor issues, such as emerging crime patterns or rapidly growing groups of offenders. It remains an open research question how this type of information should be presented for visual exploration. To get a sense of how users work with different types of visualisations, we evaluate a matrix and a node-link diagram in a controlled thinking aloud study. We describe the sense-making strategies that users adopted during explorative and realistic tasks. Thereby, we focus on the user behaviour in switching between the two visualisations and propose a set of nine strategies. Based on a qualitative and quantitative content analysis we show which visualisation supports which strategy better. We find that the two visualisations clearly support intelligence tasks and that for some tasks the combined use is more advantageous than the use of an individual visualisation
    corecore